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ABSTRACT 

We present an approach to extract measured information 
from text {e.g., a 1370 °C melting point, a BMI greater 
than 29.9 kg/m^). Such extractions are critically important 
across a wide range of domains — especially those involv¬ 
ing search and exploration of scientific and technical doc¬ 
uments. We first propose a rule-based entity extractor to 
mine measured quantities {i.e., a numeric value paired with 
a measurement unit), which supports a vast and comprehen¬ 
sive set of both common and obscure measurement units. 
Our method is highly robust and can correctly recover valid 
measured quantities even when significant errors are intro¬ 
duced through the process of converting document formats 
like PDF to plain text. Next, we describe an approach to 
extracting the properties being measured {e.g., the property 
pixel piteh in the phrase “a pixel pitch as high as 352 qim”). 
Finally, we present MQSearch: the realization of a search 
engine with full support for measured information. 

Categories and Subject Descriptors 

1.2.7 [Artificial Intelligence]: Natural Language Process¬ 
ing— Text Analysis-, H.3.3 [Information Storage and Re¬ 
trieval]; Information Search and Retrieval —Search Process 

Keywords 

text mining, information retrieval, information extraction, 
measured quantities, numerical queries 

1. INTRODUCTION AND MOTIVATION 

Scientific and technical documents describe methods and 
results using measured quantities: a numeric value paired 
with a unit of measurement. Examples of text snippets con¬ 
taining such measured quantities include: 

• average gravity curvature^ = (1.3999±0.003)xl0“®s“^m 

• 12 “C melting point 

• distance from Earth to the Sun is 9.3 x 10^ miles 
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• average responsivity as low as 6.2 pA/K 

Note that these measured quantities {e.g., 6.2 pA/K) are 
typically associated with a specific measured property {e.g., av¬ 
erage responsivity). In this paper, we study ways in which 
to extract these kinds of measured information from doc- 
umentsQ The mining of such information is critically im¬ 
portant across many domains — especially those involving 
search and exploration of scientific and technical articles. 
For instance, an optics researcher may wish to know if the 
performance of Nd:YAG laser-pumped KTP parametric os¬ 
cillators has ever been tested at wavelengths longer than 
2.4 fim. Full-text search engines using inverted indexes al¬ 
low ad hoc queries on terms such as “KTP parametric oscil¬ 
lator”, but the ability to further filter search results based 
on wavelengths greater than 2.4 pm is not typically sup¬ 
ported. To accomplish this, one must first identify and ex¬ 
tract valid measured quantities {e.g., 2.4 pm) in unstruc¬ 
tured text and, then, identify and extract the properties be¬ 
ing measured {e.g., wavelength). These extractions could 
then be stored in the index of a search engine in a way that 
supports subsequent document queries on measured infor¬ 
mation {e.g., faceted navigation, numeric range queries). 

Surprisingly, there is very little existing work on how best 
to realize this process. Lines of research most closely re¬ 
lated to the present work include extracting numerical at¬ 
tributes {e.g., [114] 1. supporting numerical document queries 
{e.g., [5][T2]), and formula identification {e.g., 0 )- However, 
none of these existing works address the comprehensive ex¬ 
traction of and search for measured information in document 
data, as described above. Indeed, numerous challenges exist 
in such scenarios. Many widely-used, full-text search engines 
{e.g., Apache Solr) convert the original document format 
to plain text prior to indexing and storage — an extremely 
error-ridden process. For instance, in the extracted text, 
exponents are typically lost {e.g., 10® becomes 105, s~^ be¬ 
comes s-2). Moreover, the conversion of some characters can 
be highly inconsistent and unpredictable. A simple minus 
sign can be converted to a range of different dash characters 
or even “garbage” characters. The same is true for other 
symbols such as p, multiplication signs, and degree sym¬ 
bols. It is this inconsistent and error-ridden text, then, that 
is ultimately stored in the index of the search engine mak¬ 
ing it virtually impossible to adequately locate documents 
by measured quantities. Without the correct identification 
of measured quantities, it is virtually impossible to identify 
properties being measured, which are critical in efficiently 

^ We define measured information as measured quantities and 
the measured properties to which they are associated. 




navigating scientific and technical articles for state-of-the- 
art information. In general, there is a great deal of het¬ 
erogeneity in how measured quantities and measured proper¬ 
ties appear in text - both naturally and through corruption. 
This, then, motivates the current investigation of how best 
to extract such information. 

Recent studies HE] have revealed that rule-based ap¬ 
proaches to information extraction tend to be more effec¬ 
tive, interpretable, and customizable than state-of-the-art 
machine learning approaches. We employ rule-based extrac¬ 
tion methods in this work. Our contributions are as follows: 

• We propose and describe a rule-based entity extractor to 
identify measured quantities in unstructured text docu¬ 
ments. Our method includes an error-correcting proce¬ 
dure that recovers from aforementioned text conversion 
errors by 1) reverse engineering the corrupted and man¬ 
gled measured quantities back to their original, correct 
form and 2) standardizing this form for storage in an 
inverted index and subsequent query processing. 

• Using these extracted measured quantities, we show how 
to further extract the measured properties to which they 
are associated. 

• Finally, we present MQSearch: the realization of a 
search engine with full support for measured informa¬ 
tion. MQSearch is a facet-based navigation system that 
allows users to navigate large document sets based on 
measured quantities, measured properties, and the top¬ 
ics and themes to which they are associated. To the best 
of our knowledge, no other search engine in existence 
fully supports such a capability. 

We begin with describing the extraction of measured quan¬ 
tities. 


2. MEASURED QUANTITIES 

We view measured quantities as a 5-tuple of the form: 
{sign, number , error, scientific notation, units ), where un¬ 
derlined elements are mandatory and others are optional. 
As an example, a team of researchers in Italy recently re¬ 
ported the first direct measurement of gravity’s curvature 
as (1.3999 ± 0.003) x |10| . The corresponding 

5-tuple representation of thifl is: 

(<empty>, 1.3999, 0.003, 10“®, 

5-tuples such as this are populated using a series of extrac¬ 
tion rules that operate on individual sentences. These rules 
fall into four broad categories: 1) pre-processing, 2) units, 
3) quantities, and 4) post-processing. Simplified forms of 
some of the rules for units and quantities are shown in Ta¬ 
ble [TFI We refer to the algorithm implementing such rules 
as Measured Quantity Extractor or MQE. We begin with 
pre-processing rules. 

Pre-Processing. As mentioned previously, when extract¬ 
ing text from various document formats {e.g., PDF, MS Of¬ 
fice), characters often appear inconsistently. Minus signs, 

^ Since there is no explicit sign in this example, the first 
element is left empty. 

^ Rules are shown in Perl-like syntax, the de facto standard 
for regular expressions. 


multiplication signs {e.g., x, ■), equal-like symbols {e.g., 

=), degree symbols, and the p character can appear in 
a variety of ways or, in some cases, as “garbage” characters. 
For instance, minus can appear as the en dash character or 
appear corrupted as a€. Pre-processing rules identify these 
variations in text and perform the necessary normalization 
for accurate extraction of units and quantities. 

Units. A measurement unit preceded by a numeric string 
conforming to the 5-tuple structure is the base indicator of 
a measured quantity. Thus, to identify valid measured quan¬ 
tities, we require a comprehensive ontology of units. We 
obtained an initial units ontology from the OBO FoundryQ 
but this was quite incomplete. We, then, expanded the on¬ 
tology using largely public sources {e.g., convert-me.com 
DoD technical reports. Physical Review, Nature Commu¬ 
nications). Each unit has an associated rule. An example 
rule for m {i.e., symbol for meters) is shown in Rule 5 of 
Tabled] Note that such rules include optional prefixes for 
submultiples and multiples {e.g., p before m, kilo before me¬ 
ter). Unit rules, when combined with pre-processing rules 
described previously, can accurately extract units under a 
range of noisy conditions. For instance, the corrupted unit 
ma€l is correctly recovered as m~^ by MQE. Finally, as 
shown in Rules 6 and 7, compound units are also supported 
{e.g., km/h, kilometer per hour, s~^ ■ m~^). 

Quantities. Like units, quantities {i.e., numbers with op¬ 
tional error ranges and scientific notation) can appear in 
a range of ways due to both corruptions and natural vari¬ 
ation. These variations are collectively captured by rules 
such as those shown in Table [T]( i.e., Rules 1-4), which pop¬ 
ulate the remainder of the 5-tuple structure. As shown in 
Table [T] such rules capture a wide range of quantity formats 
{e.g., 10, 000 with a comma, 1.3999±0.003 x 10“® with both 
an error range and scientific notation, 1.23 x 105 with lost ex¬ 
ponent in 10®). To support numeric range queries, extracted 
quantities are standardized prior to storage in a search en¬ 
gine index {e.g., the extracted quantity 1.3999±0.003 x 10~® 
is stored simply as 0.000013999) |11| . 

Post-Processing. We have already seen that text extracted 
from various document formats can be noisy. For instance, 
information from tables, headers, and figures can sometimes 
result in seemingly random sequences of numbers and let¬ 
ters in extracted text. In some cases, such information can 
erroneously be picked up by aforementioned rules as mea¬ 
sured quantities. This is especially true for single letter units 
{e.g., m for meters, A for Ampere, etc.). Post-processing 
rules are employed to reject such extractions and minimize 
false positives. Examples of such rejection rules include 
context-based rules {e.g., reject when preceded by “Table” 
or “Figure”), repetition-based rules such as rejecting com¬ 
pound units consisting of repeated single letter units {e.g., 3 
AJmm), and allowing a dash only between certain quantities 
and units {e.g., 10-cm is okay but not 10-A). 

As we will show in Section H when used in combination, 
these rules collectively enable highly accurate extractions of 
measured guantities - which, in turn, can be exploited to 
extract the properties being measured, as described next. 


^ http://WWW.obof oundry.org/ 






Rule 

Pattern 

Example Matches 

1) number 

[-r-l?(\d((\d?\d?l.]\d{2.3}{l.l\d{2.3})*)l\d*))(\.{\d[\d\s]*\d|\d))? 

1000.05, +5, -0.2, and 1,000 

2) number (leading point) 

l-t-l?\.\d(\d\d(\s\d{3})-t(\s\d{1.3})?|\d*) 

-.98. .04. -I-.755 

3) error 

(W0,2} ± \s{0,2}l\d.H)? 

±0.003 in “1.3999 ± 0.003” 

4) sci. notation. 

(\s*[eE]|\s*([xXx ])\s*10 *\~? [-|-]?\d-|-)? 

e.g., forms of xlO””: Xl05, e-5, E-5 

5) unit 

e.g., [fpn/j,nicdk]?m([\"]?[2-6] | [\-][l-6]) — m^ normalized to m"# 

pm, m—1 cm2 (cm^), cm''2 

6) connector 

(\s?/\s? 1 [Pp]er l-per-l [-\sx •*])? 

per, /, •, X 

7) compound unit 

<unit>(<connector><unit>) + 

km/h, kilometer per hour,km-h~^ 


Table 1: [MQE Rules.] Simplified forms of some rules for extraction of measured quantities. 


Pattern 

Example Matches (two examples shown for each rule) 

NP SYM{0.2} EQ mq 

1) gravity curvature ^ = 1.4 X 10~^s~^m~^ 2) floor area ~ 32m^ 

mq IN? NP 

1) a 352 pm pixel pitch 

2) 50mL of 30% fuming sulfuric acid 

NP IN DT? NP VP± (TO|IN|RB|JJ)* mq 

1) strenath of panel was set to 9 ksi 

2) freq. of scans was roughly 300 Hz 

NP (IN DT? NP)* VP± (IN|TO|RB|JJ)* mq 

1) pixel pitch employed was 352 pm. 

2) panel strenath was recorded at 9 ksi. 

NP (CC|IN|TO|RB|JJ)* \(?mq\)? 

1) wavelengths of at least 2.4 pm 

2) panel strength (9 ksi) 


Table 2: [MPE Rules.] Simplified forms of some syntactic patterns to extract measured properties. 


3. MEASURED PROPERTIES 

We now tnrn our attention to the extraction of measured 
properties. To better illnstrate the problem, we show several 
example snippets containing measured quantities. In each 
example, the measured quantity is shown in blue, the prop¬ 
erty being measnred is highlighted in red, and the characters 
connecting them are underlined : 

• a pixel pitch as high as roughly 352 pm 

• a 352 pm pixel pitch 

• The pixel pitch employed was 352 pm. 

• average gravity curvature =(1.3999±0.003) x 10~® s~^m’ 

• with 5QmL of 30% fuming sulfuric acid 

• size = O.lm^ 

• frequency of longitudinal scan was approximately 300 Hz. 

• a nominal current density of 1.3 A/err? to 0.03 A/err? 

• panel strength lower than 8.90 ksi (61.4 MPa) 

• wavelengths at least 2.4 pm 

• large fields of about, or above 10 kV/cm 

From jnst the examples shown, it is easy to see that there 
is an extremely high degree of variability in the words con¬ 
necting a measured property with a measured quantity. These 
examples represent just a small sample of the many possible 
variations. However, upon closer inspection, we find that 
this variability can be reduced to a small number of syn¬ 
tactic patterns based on part-of-speech (POS) that capture 
most scenarios. Table [2] shows some syntactic patterns that 
we employ to extract measured properties. We refer to the 
extractor applying such syntactic rules as Measured Property 
Extractor or MPE. 

In Table [21 noun phrases shown in red {i.e., NP) are 
extracted and taken as the measured property. Measured 
quantities are represented in blue by mq. The EQ tag rep¬ 
resents all symbols related to ’=’ {e.g., «, ~). The SYM tag 
matches one or two character symbols {e.g., a greek letter). 
Other symbols {e.g., JJ, RB, IN, CC, VP) are part-of-speech 
tags in Penn Treebank format. Note that tags such as RB 
{i.e., an adverb) should be taken to include variations such 
as the comparative and superlative forms. This is not explic¬ 
itly shown for reasons of brevity. This small set of patterns 
matches a very wide range of possible phrase combinations 


for measured properties and are executed sequentially in the 
order shown. We implemented MPE using the Brill part- 
of-speech tagger [2]. As we will show in the next section, 
the accuracy with which our algorithms are able to extract 
measured properties and measured quantities is remarkable 
— especially given the aforementioned issues with noisy and 
corrupted input text. 

4. EXPERIMENTAL EVALUATION 

Since our research is sponsored by the U.S. Department 
of Defense (DoD), we evaluate our approach on a text cor¬ 
pus consisting of 40,807 unclassified research reports pub¬ 
lished in the 2008-2010 time frame and hosted by the De¬ 
fense Technical Information Center (DTIC). This rich col¬ 
lection describes a wide range of research funded by the DoD 
spanning numerous fields from engineering and physical sci¬ 
ence to biomedical research and social science. The DTIC 
documents considered in this paper have been approved for 
public release and unlimited distribution. All documents 
are in PDF format, and text was extracted from them us¬ 
ing the pdf tot ext utilityQ From this collection, we gener¬ 
ated samples using the following procedure. To evaluate the 
ability of MQE to extract measured quantities, we sampled 
uniform random sentences from the population of all sen¬ 
tences containing a numeric value. By examining sentences 
with a number (but not necessarily a measurement unit), 
we are able to accurately identify false negatives in addition 
to false positives. Next, to evaluate the ability of MPE to 
extract measured properties, we generated a random sample 
of sentences from the population of all sentences containing 
a measured quantity, as identified by MQE. We employed 
sample sizes of 1000 and 500 for MQE and MPE, respec¬ 
tively. This produced sufficient 95% confidence bounds on 
our estimates for precision and recall over the entire corpus. 
Different fields employ different measures in different ways. 
By considering sentences sampled randomly in this fashion, 
we are able to evaluate our methods on text data that cap¬ 
ture the diverse ways in which measured information is re¬ 
ported across different fields. To the best of our knowledge, 
no other approaches exist for extracting such measured in¬ 
formation from scientific and technical documents. Thus, 
there are no appropriate baselines against which our meth- 

"http://www.foolabs.com/xpdf/home.html 


































ods can be compared. Table[3]shows the precision and recall 
estimates for both the measured quantity extractor and the 
measured property extractor over the entire corpus. 


Extractor 

Precision 

Hecall 

MQE 

(0.93, 0.99) 

(0.92, 0.99) 

MPE 

(0.93, 0.97) 

(0.88, 0.94) 


Table 3: 95% Confidence Intervals for precision and re¬ 
call when extracting measured quantities (using MQE) and 
measured properties (using MPE) from the DTIC corpus. 

As can be seen in the table, both MQE and MPE per¬ 
form extraordinarily well in extracting measured quantities 
and the properties they describe from documents across dis¬ 
parate fields. Having demonstrated the success with which 
measured information can be mined, we now demonstrate 
how these extractions can be exploited in novel search ap¬ 
plications. 

5. AN APPLICATION: MQSEARCH 

Here, we present MQSearch: a realization of a search en¬ 
gine with full support for measured information. MQSearch 
is implemented using Apache SoliQ and AJAX SoliQ, both 
of which support full-text search, faceted navigation, and 
numeric range queries. During the process of indexing and 
ingesting the DTIC document set into our search engine, we 
apply our extractors to encountered text and store both mea¬ 
sured quantities and measured properties in the search engine 
index. In addition, the search engine performs keyphrase ex¬ 
traction on documents using the KERA algorithm described 
in [S]. Using Solr filter queries, extracted keyphrases can be 
used to produce a tag cloud for any subset of the docu¬ 
ment set. Figure [T] shows the faceted navigation panel of 
MQSearch, which allows users to filter documents based 
on discovered measurement units, quantity ranges, and mea¬ 
sured properties. In Figure [T1 the measurement unit U/mL 
is selected. We see that there are 153 documents (out of 
roughly 40,000) mentioning this unit with quantities rang¬ 
ing from 0.001 U/mL to 10,000 U/mL. The property most 
frequently measured in U/mL is penicillin. From the tag 
cloud, we see that documents containing quantities mea¬ 
sured in U/mL tend to cover topics such as breast cancer 
and prostate cancer research^ The search results can be 
filtered further along any of these dimensions. Filtering by 
LDA-discovered topics is also supported but not shown in 
the figure [^. To the best of our knowledge, ours is the first 
search engine with such support for measured information. 

6. CONCLUSION 

In this paper, we have proposed a demonstrably effec¬ 
tive approach to extracting measured information from un¬ 
structured text data. We showed both how to extract mea¬ 
sured quantities and the properties being measured. We fur¬ 
ther demonstrated how such extractions might be used in a 
search engine for documents rich in measured information. 
To the best of our knowledge, no other search engine in ex¬ 
istence supports such functionality. Our extraction methods 

*http://lucene.apache.org/solr/ 

'https://github.com/evolvingweb/ajax-solr 
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Top Discovered Keywords - 

bone marrow I breast cancer i cancer cells i 

cell death | cell line | cell IlneS I <^®ll proliferation | cell responses | 
dendritic celis | ex vivo | gene expression | human breast | 
mcf-7 cells | metastatic breast | nerve agent | nerve agents | 

ovarian cancer | prostate Cancer I stem cell I 
stem cells | transgenic mice | tumor Cells 

Units - 
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to: 10000 
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penicillin [55] 
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collagenase [5] 
il-4 [4] 


led. Documents are coui 
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Figure 1: [MQSearch.] The measurement unit 

U/mL is selected, which reveals the associated topics 
{e.g., breast/prostate cancer), associated measured proper¬ 
ties {e.g., concentrations of penicillin), and associated quan¬ 
tity ranges {i.e., 0.001 to 10,000). 


have the potential to substantially improve search, naviga¬ 
tion, and exploratory analysis of large or even massive col¬ 
lections of scientific and technical articles. For future work, 
we plan on marrying our proposed approaches with other 
well-studied techniques for exploratory search. 
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